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We discuss, in terms of rate-distortion theory, the fitness of molecular codes as the problem of 
designing an optimal information channel. The fitness is governed by an interplay between the cost 
and quality of the channel, which induces smoothness in the code. By incorporating this code fitness 
into population dynamics models, we suggest that the emergence and evolution of molecular codes 
may be explained by simple channel design considerations. 

PACS numbers: 87.10. +e, 87.14.Gg, 87.14.Ee 
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Living systems store information in one form of a 
molecule (e.g. DNA) and use it to produce a different 
molecule (e.g. protein), usually relying on intermediary 
recognition processes by other molecules (e.g. tRNA). 
This information transfer is a code, albeit one that must 
perform in a noisy environment. Such noisy informa- 
tion channels arc omnipresent in biology and analyzing 
them can highlight the engineering constraints on liv- 
ing systems and their impact on fitness. Several studies 
have examined the biophysical makeup of the transcrip- 
tion regulatory network (TRN) to scrutinize the effect of 
this information system on fitness [l|-|3|. 

In this work, we first introduce a measure for the fit- 
ness of molecular codes. The quality of the code is mea- 
sured by the distortion of a typical message. The cost 
is the typical number of bits required to write one mes- 
sage. The overall fitness of the code is the weighed sum of 
cost and quality. This is similar to the basic problem of 
rate-distortion theory [J, [5( - how to design an optimal 
information channel by balancing the cost against the 
required transmission quality. We find that the relevant 
control parameter is the derivative of cost with respect to 
quality, termed gain. The code appears at a phase tran- 
sition in the information channel [5Mll| with the gain 
playing a role of an inverse temperature. To examine the 
appearance of codes we then turn to models describing 
populations of information-processing systems, simplified 
"organisms" , which compete and evolve according to the 
fitness of their codes. We show that the coding transition 
can be induced by changing a number of parameters such 
as the accuracy of reading and the population size. Fi- 
nally, we treat two realistic scenarios of deviations from 
the simplified ideal dynamics, which involve mutations 
and genetic drift (i.e. reproduction fluctuations). Mu- 
tations broaden the population to include systems with 
lower code fitness, creating a "quasi-species" with a re- 
duced effective fitness. Genetic drift delays the coding 
transition to higher gains. 

The fitness of molecular codes. - Molecular codes of- 
ten relate two sets of molecules, which we may think of 
as symbols and their potential meanings. In the genetic 
code, for example, the symbols are the 64 DNA base- 
triplets (codons) and their meanings are the 20 amino- 



acids and the stop signal [12j. In the case of the TRN, 
DNA sites are the symbols and the meanings are the tran- 
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scription factors that bind the sites 
the quality and cost of a biological code can therefore be 
regarded as a semantic problem of wisely assigning mean- 
ings to symbols. To discuss this semantic problem, we 
consider an information channel that relates two spaces, 
one with s symbols and the other with m meanings (Fig. 
[IJ. The channel describes how meanings are stored in 
memory as molecular symbols, and how the symbols are 
read to reconstruct the meaning. 
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FIG. 1: Molecular codes as noisy information channels. 

The code relates the space of meanings (left) with that of 
symbols (right). The channel is a three-stage Markov process 
(blue arrows): (i) A meaning a is encoded as a symbol i by 
the encoder e a i (ii) i is read as j by the reader ry(iii)j is 
decoded as u by the decoder dj u - The distance between the 
original and the reconstructed meanings is c auj (red arrow). 



Molecular codes rely on error-prone binding and the 
channel is therefore described by a three-stage stochastic 
process [a, LD, la, [lfj : (i) The storage of meanings in mem- 
ory as symbols is represented by an encoder matrix e a t, 
the probability that a meaning a is encoded by a symbol 
i. (ii) The symbol is read as described by the reader ma- 
trix rjj, the probability to read the symbol i as j, which 
accounts for possible misreading errors, (iii) Finally, the 
read symbol j is interpreted as carrying a meaning u) 
according to a decoder matrix dj U . The distortion be- 
tween the original meaning a and the reconstructed one 
to is measured by the distance c auj . In the genetic code, 
for example, amino-acid meanings are encoded as base- 
triplet symbols, which in turn are read by tRNAs at the 
ribosome. Finally, the decoded amino-acid carried by the 



tRNA is ligated to the synthesized protein. 

To estimate the quality of the coding system one ex- 
amines how well a reconstructed meaning preserves the 
original one. This is measured by the average distortion 
D [J, H along all possible paths a — i —_j — uj between 
original and reconstructed meanings [7|, la, [l0( . The de- 
mand for each meaning a is f a , which accounts for the 
possibility that some meanings are used more frequently 
than others. To calculate D, each path is weighed by its 
probability, faeaifijdjuj, and the summation yields: 
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The reader r,j may be represented as a graph, in which 
the nodes are the symbols and edges connect symbols 
that are likely to be confused (e.g. Fig. [2p) (Hill. If 



the reader was ideal (r,y = 

advantageous to decode as many meanings as there are 
available symbols. However, since the molecular reader is 
not perfect it is preferable to decode fewer meanings and 
thereby minimize the effect of misreading errors. More- 
over, the preferable codes are smooth, that is symbols 
that are likely to be confused encode the same meaning 
or mea ning s that are close with respect to the distance 



fitness H is driven by evolution towards maxima, as man- 
ifested by the minus signs in H. The gain k, ~ dl /dD 
measures the bits of information required to increase the 
quality. The gain n is expected to increase with the com- 
plexity the organism and its environment: The circuitry 
of a complex organism transmits more signals and reads 
a larger genome. It is therefore beneficial for this organ- 
ism to pay a larger cost to improve the quality of its code, 
since it gains more from such an improvement. Similarly, 
the gain is larger in a richer environment. 

Population dynamics in the code space. - To exam- 
ine how codes evolve in response to changes in the gain, 
we consider a population of simplified "organisms" that 
compete according to the fitness of their codes. We imag- 
ine a scenario where - for a given demand f a , deter- 
mined by the environment, and a given reader r%j - each 
"organism" has a code specified by its encoder e a i and 
then it would have been decoder dj M . The optimal encoder and decoder are re- 
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12-15| (Fig. Et>). 



A common measure for the cost of a coding system 
is the mutual information / that estimates the average 
number of bits required to encode one meaning [5| , 
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where Ui = J2 a fa e ai is the overall probability to use 
the symbol i. In molecular codes / is directly linked 
to fitness: The decoder e a i is the probability that the 
molecule carrying the meaning a binds the molecular 
symbol i. In the TRN, for example, a is a transcrip- 
tion factor and i is a prospective DNA binding site. The 
binding probability e a i scales like a Boltzmann exponent 
e a i ~ expe a i, where the binding energy e a i is in ksT 
units. It follows that / is actually the average bind- 
ing energy / = Y, a ,i fae a i(s a i ~ £i) = (e a i -Si), with 
the reference energies Ei = hi^o fp expe^. In several 
molecular codes (e.g. TRN) the binding energies e a i are 
approximately linear in the size of the binding sites. The 
cost / is therefore proportional to the average size of the 
binding site (l|43|- The evolutionary cost to replicate, 
transcribe and translate the gene that encodes the bind- 
ing site and the cost to correct mutations in this gene, 
are all expected to be linear in the binding site size [la ]. 
/ is therefore proportional to the actual fitness cost. 

To optimize the molecular coding apparatus, its cost 
and distortion must be balanced. We describe this in- 
terplay as the maximization of an overall code fitness, 
H = —D — k^ 1 !. While / and D are to be minimized, the 



lated through Bayes' theorem [7H10l|. dj U J2b i fp e P 



fu J2i e ui r iji which states the intuitive notion that if an 
encoded meaning w tends to be read as the symbol j then 
it is likely that j is decoded as ui 17(. Therefore, it suf- 
fices to identify every organism by its encoder e a i and one 
may describe the population as points in a "code space" , 
which is spanned by all possible encoders (Fig. [2j\). This 
space is an to x s— dimensional unit cube < e a i < 1 and 
each axis corresponds to an entry of the encoder e a i. An 
organism is represented by a point in the cube and the 
population is a "cloud" of such points of probability den- 
sity ^{e a i). Since the encoder obeys the to conservation 
relations J^ e a i = 1, the effective dimension is reduced to 
to x (s — 1). In the following, we treat three limiting cases 
of population dynamics in the code space: first, a large 
population with negligible mutation-rate, next, a large 
population with significant mutation rate and, finally, a 
smaller population with considerable genetic drift. 

The coding transition. - To find the coding transition 
we first look at simplified case of large populations with 
negligible mutation rate. These tend to peak at an op- 
timal value of the encoder e* i and therefore may be ap- 
proximated by a delta- function, ^>(e a i) = 5{e a i — e* ai ). 
As a result, the dynamics in this regime amounts to 
tracing the evolution of the optimal code as the gain k 
changes. The optimal code is found at the extremum, 
dH/de a i = (l7(, which leads to 
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In this Boltzmann partition, the effective energies are 
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J2-y dj.yC~ fU j) and the gain k 



plays the role of an inverse temperature, i.e. organisms 
with lower k are "hotter" and their codes are noisier. 

A simple example for the evolution of a code with in- 
creasing gain n is graphed in Fig. [2j^-B. At low k, the op- 
timal encoder is e a i = ui and / vanishes since the encoder 
is a— independent and therefore conveys no information 



about the meanings (Eq. [2|. For this reason this state 
is termed non-coding. To pinpoint the transition, we ex- 
amine the stability of the fitness with respect to small 
variations of the encoder Se a i — e a i — Uj. The variation 
Se a i is the order-parameter that describes the emergence 
of a coding state, 5e a i 7^ 0, with correlated meanings and 
symbols. We find that the coding /no-coding transition 
takes place at a critical gain k c [l7[, 
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where \* c is the maximal eigenvalue of the norma 
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ized distance, C a 
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value of the weighted square of the reader R^j 
y/UiUj J2k( r ^rkj/ J2t u t r tk) !1?]. AJj. corresponds to th 
smoothest non- uniform eigenvector 5e* ai 7^ 0, which rep 
resents a coding state [TJ, lla, llSj- This eigenvectoi 



which emerges at the coding transition (Fig. 03-D), i 
the first-excited state of the system and measures th^ 
tendency of a meaning a to be encoded by the symbol i. 
Boltzmann partitions and consequent phase transitions 
are common in rate-distortion theory and analogous op- 
timization problems in the context of clustering, deter- 
ministic annealing and self-organizing maps [4-lll|. 

The critical gain (Eq. [JJ indicates three possible path- 
ways from the random, non-coding state towards the 
emergence of a code: via increasing the gain k, via in- 
creasing the reading accuracy (larger A^) or via increas- 
ing the average distance between meanings (larger Ac). 
We suggest that such simple coding/non-coding transi- 
tions may describe the emergence of biological codes. In 
the case of TRN, for example, one imagines the primor- 
dial circumstances when a primitive organism had only 
one universal transcription factor that binds all DNA 
sites (Fig. [2J 7 ). Then, as k increases, for example the 
environment becomes richer in information, the factor 
splits into several distinct factors, each binding to spe- 
cific sites. In the case of the genetic code, a series of 
transitions (like those in Fig. [2)3) is thought to describe 
the emergence and evolution of the code 12j, [la, LL8J . 

Effects of mutations. - Mutations add another kind of 
noise, smearing the population over a larger region of the 
code space. When the mutation rate fi is significant one 
may model the population in terms of reaction-diffusion 
dynamics, in the spirit of the quasi-species model 19], 
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In Eq. [SJ each organism in the population reproduces at 
a rate equal to the fitness of its code H(e a i) (the reac- 
tion term). However, codes may mutate at a rate /1. This 
random walk in the code space is described by the dif- 
fusion term. The fitness H is normalized by the average 
fitness H = J ^(e a i)H(e a i)de a i to ensure conservation 
of the probability distribution. Typically, \fr approaches 
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FIG. 2: Emergence and evolution of molecular codes. 

(A) A code relates m — 2 meanings {1, 2} and s — 2 sym- 
bols {1, 2}. The encoder e a i has 4 entries and is constrained 
to a 2D square by the 2 conservation relations en + ei2 = 
e.2\ + e22 = 1. The reader is ry = (1 — 2e)5ij + e with 
the misreading probability e = 0.1. At low mutation rates 
the population peaks at the optimal encoder (illustrated as 
sharp peaks). Below the critical gain k c = 1.56 the state is 
non-coding with e a i = i. Above k c a coding state evolves. 

(B) The channel cost I increases from at the coding tran- 
sition, k — k c , while the the distortion D decreases. The av- 
erage order-parameter < \8e\ > increases continuously from 
at the second-order coding transition. The fitness H (plot- 
ted is —H) increases to an asymptotic value. (C) A code 
that relates 8 meanings and 6 symbols. The distance is 
Cau = min(|a — w|,8 — \a — w|) and the reader is defined 
by a probability of 0.98 that i is read as i and 0.01 that it is 
read as one of its two neighbors on the symbol graph. (D) 
The optimal encoder e a i is plotted as color-coded 6x8 arrays 
at increasing gains k. Below k c — 0.52 (top left) the encoder 
is e a i — 1/6 with uncorrelated symbols and meanings. A 
coding state emerges at n c . The symbol- meaning correlation 
increases with k until every meaning a is encoded by exactly 
one symbol i (bottom right). The optimal code is smooth, 
i.e. close meanings are encoded by close symbols, as mani- 
fested by the continuous diagonal shape of the encoder. (E) 
Quasi-species dynamics of the code from A with mutation 
rate \x = 5 • 10 _J . Below n c — 1.56, the population distribu- 
tion ^ is smeared around the non-coding optimum. Above 
k c , a coding state appears, ^ sharpens and migrates towards 
the one-to-one code en = e22 = 1. (F) A coding transition 
in the TRN, when a universal transcription factor splits into 
distinct species when the gain k increases. 



a steady-state, which corresponds to the eigenmode of 
maximal H [19(. To find the steady-state we approx- 
imate the fitness by a quadratic expansion around an 
optimum H ~ H*-\ Y, a ,i,u,j Qaiuij8e ai 5e u j. Assuming 
a Gaussian ansatz for ^ we find the steady-state |17| . 
tf ~ exp[-(8^)~ 1/2 E a ,i,«j y/QctiujfcodSeuj), where V^ 
is the square root of the Hessian Q a iuj ■ & indicates that 
the mutations smear the population over a width that in- 
creases with the mutation-rate as ~ [^'^(Fig. HJD), which 
may be significant even for relatively low mutation rates 
due to the small exponent. The leakage by mutations 
from the optimal code to lesser codes reduces the aver- 
age fitness, H = H* - (^/2) 1 / 2 Tr,/Q [l7|. At the coding 
transition (Eq. HJ), the Gaussian \t becomes infinitely 
wide in the direction of the emergent coding eigenvec- 
tor 8e* a i , a precursor of the appearance of a coding state 
along this direction. 

Effects of genetic drift.- The quasi-species dynamics 
is deterministic in the sense that it neglects random re- 
production fluctuations, termed genetic drift, which are 
irrelevant in large populations. However, when the effec- 
tive size of the population n is small, nfj, <1- considered 
to be the relevant condition during the emergence of the 
genetic code, for example - genetic drift is a major de- 
terminant. The typical dynamics in this regime exhibits 
long periods of time when the population resides in the 
vicinity of a fitness optimum separated by short tran- 
sients of diffusion by genetic drift to another optimum. 
For our purpose, it is convenient to coarse-grain this dy- 
namics in space and time and regard it as instantaneous 
random transitions between the optima. In this type of 
dynamics the distribution ^ approaches asymptotically a 



2X* R X^ [T7[. This also adds a fourth pathway 
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with an "inverse tempera- 



ture" that is equal to the population size n, up to a factor 
of order unity 3, l20j . It is convenient to define a free en- 
ergy [20] , F = {—H)—nT 1 S, in which the fitness is minus 
the Hamiltonian and the entropy of the genetic drift is 
S = — J ^\n^de a i. A mean-field treatment yields the 
approximation (akin to a mean-field Potts model) [17] . 
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where e a i is the average encoder e a i — J e a i^{e a i)de a i. 
Eq. [S] indicates that the genetic drift contribution S adds 
another source of randomness to that of the cost 7; both 
drive the system towards the random non-coding state. 
From stability analysis of F it follows that the genetic 
drift shifts the critical transition to higher gains, k~ x + 



towards the coding transition, via population growth, to 
the three pathways suggested by Eq. |U To give an order- 
of-magnitude estimate for k c and n c , we notice that if 
the misreading probability is relatively small (the non- 
diagonal terms Rtj <C 1) then A^, ~ 1. It follows that the 
smaller of k c and n c is of the order of 1/A^, which roughly 
scales like the 1/ (fitness reduction by one readin g er ror). 
Such cost-quality considerations are generic IB) 12JJ and 
may help to understand the evolution of other biological 
information-processing systems. 
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